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ABSTRACT 

We design a space efficient algorithm that approximates the 
transitivity (global clustering coefficient) and total triangle 
count with only a single pass through a graph given as a 
stream of edges. Our procedure is based on the classic prob- 
abilistic result, the birthday paradox. When the transitivity 
is constant and there are more edges than wedges (common 
properties for social networks), we can prove that our algo- 
rithm requires 0{y/n) space (n is the number of vertices) to 
provide accurate estimates. We run a detailed set of experi- 
ments on a variety of real graphs and demonstrate that the 
memory requirement of the algorithm is a tiny fraction of 
the graph. For example, even for a graph with 200 million 
edges, our algorithm stores just 60,000 edges to give accu- 
rate results. Being a single pass streaming algorithm, our 
procedure also maintains a real-time estimate of the transi- 
tivity/number of triangles of a graph, by storing a miniscule 
fraction of edges. 

Categories and Subject Descriptors 

F.2.2 [Nonnumerical Algorithms and Problems]: Com- 
putations on Discrete Structures; G.2.2 [Graph Theory]: 
Graph Algorithms 

General Terms 

Algorithms, Theory 
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1. INTRODUCTION 

Triangles are one of the most important motifs in real world 
networks. Whether the networks come from social interac- 
tion, computer communications, financial transactions, pro- 
teins, or ecology, the abundance of triangles is pervasive and 
it is critical feature that distinguishes real graphs from ran- 
dom graphs. There is a rich body of literature on analysis 
of triangles and counting algorithms. Social scientists use 
triangle counts to understand graphs [16, 26, 10, 37]; graph 
mining applications such as spam detection and finding com- 
mon topics on the WWW use triangle counts [17, 6]; motif 
detection in bioinformatics often count the frequency of tri- 
adic patterns [23]. Nevertheless, counting triangles contin- 
ues to be a challenge due to sheer size of the graphs (easily 
in the order of billions of edges) . 

Many massive graphs come from modeling interactions in 
a dynamic system. People call each other on the phone, 
exchange emails, or become part of a tight unit (e.g, co- 
authoring a paper); computers exchange messages; animals 
come in the vicinity of each other; companies trade with 
each other. These interactions manifest as a stream of edges. 
The edges appear with timestamps, or "one at a time." The 
network (graph) that represents the system is an accumu- 
lation of the observed edges. There are many methods to 
deal with such massive graphs, such as random sampling [27, 
34, 29], MapReduce paradigm [30, 25], distributed-memory 
parallelism [3, 11], adopting external memory [12, 2], and 
multithreaded parallelism [8]. 

But all of these methods need to store at least a large frac- 
tion of the data. A small space streaming algorithm main- 
tains a very small (using some randomness) set of edges, 
called the "sketch" at any given time, and updates this sam- 
ple as edges appear. Based on the sketch and some aux- 
iliary data structures, the algorithm computes an accurate 
estimate for the number of triangles for the graph seen so 
far. The sketch size is orders of magnitude smaller than the 
total graph. Furthermore, it can be updated rapidly when 
new edges arrive, and hence maintains a real-time estimate 
of the number of triangles. We also want a single pass al- 
gorithm, so it only observes each edge once (think of it as 
making a single scan of a log file). The algorithm cannot 
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Figure 1: Realtime tracking of number of triangles and transitivities on cit-Patents (16M edges), storing only 60K edges from 
the past. (Absolute values are shown with time (in years) on the x-axis.) 



revisit edges that it has forgotten. 

1.1 The streaming setting 

Let G be a simple undirected graph with n vertices and m 
edges. Let T denote the number of triangles in the graph and 
W be the number of wedges, where a wedge is a path of length 
2. A common measure is the transitivity k = 3T/W [36], 
a measure of how often friends of friends are also friends. 
(This is also called the global clustering coefficient [29].) 

Formally, a single pass streaming algorithm is defined as fol- 
lows. Consider a sequence of (distinct) edges 
Let Gt be the graph at time t, formed by the edge set 
{ei\i < t}. The stream of edges can be thought of as a se- 
quence of edge insertions into the graph. (Vertex insertions 
can be trivially handled.) We do not know the number of 
vertices ahead of time, and simply see each edge as a pair 
(u, v) of vertex labels. So new vertices are implicitly added 
as new labels. There is no assumption on the order of edges 
in the stream. Edges incident to a single vertex do not nec- 
essarily appear together. 

In this paper, we do not consider edge /vertex deletions or 
repeated edges. In that sense, this is a simplified version of 
the full-blown streaming model. Nonetheless, the edge inser- 
tion model is what is studied in previous work on counting 
triangles [5, 19, 9, 1, 20]. 

A streaming algorithm has a small memory M, and sees 
the edges in stream order. At each edge et, the algorithm 
can choose to update data structures in M (using the edge 
Ct). Then the algorithm proceeds to et+i, and so on. The 
algorithm is never allowed to see an edge that has already 
passed by. The memory M is much smaller than m, so the 
algorithm keeps a small "sketch" of the edges it has seen. The 
aim is to estimate the number of triangles \n G \— Gm at 
the end of the stream. Usually, we desire the more stringent 
guarantee of maintaining a running estimate of triangles and 
transitivity of Gt at time t. We denote these quantities 
respectively as Tt and nt- 

1.2 Results 



We present a single pass, 0(m/A/T)-space algorithm to prov- 
ably estimate the transitivity (with arbitrary additive error) 
in a streaming graph. Streaming algorithms for counting tri- 
angles or computing the transitivity have been studied be- 
fore, but no previous algorithm attains this space guarantee. 
(Buriol et al [9] give a single pass algorithm with a stronger 
relative error guarantee that requires space 0{mn/T). We 
discuss in more detail later.) 

Although our theoretical result is interesting asymptotically, 
the constant factors and dependence on error in our bound 
are large. Our main result is a practical streaming algorithm 
(based on the theoretical one) for computing k and T, us- 
ing some additional probabilistic heuristics. We perform an 
extensive empirical analysis of our algorithm on a variety 
of datasets (many thanks to SNAP [39], for their extensive 
collection of downloadable graphs). The salient features of 
our algorithm are: 

• Theoretical basis: Our algorithm is based on the classic 
birthday paradox: if we choose 23 random people, the prob- 
ability that 2 of them share a birthday is at least 1/2 (Chap. 
IL3 of [18]). We extend this analysis for sampling wedges 
in a large pool of edges. The final streaming algorithm is 
designed by using reservoir sampling with wedge sampling 
[29] for estimating We prove a space bound of 0(m/VT), 
which we show is 0{y/n) under common conditions for social 
networks. 

While our theory appears to be a good guide in designing the 
algorithm and explaining its behavior, it should not be used 
to actually decide space bounds in practice. Also, we point 
out that for graphs where T is small, our algorithm does 
not provide good guarantees for small space (since m/VT is 
large) . 

• Accuracy and scalability with small sketches: We 

test our algorithm on a variety of graphs from different 
sources. In all instances, we get accurate estimates for k 
and T by storing at most 40K edges. This is even for graphs 
where m is in the order of millions. Our relative errors on k 
and the number of triangles are mostly less than 5%. (In a 
graph with very few triangles where k < 0.01, our triangle 



estimate has relative error of 12%.) Our algorithm processes 
extremely large graphs. Our experiments include a run on a 
streamed Orkut social network with 200M edges (by storing 
only 40K edges, relative errors are at most 4%). We get 
similar results on streamed Flickr and Live-journal graphs 
with tens of millions of edges. 

We run detailed experiments on some test graphs (with 1-3 
million edges) with varying parameters to show convergence 
of our algorithm. Comparisons with previous work [9] show 
that our algorithm gets within 5% of the true answer, while 
the previous algorithm is off by more than 50%. 

• Real-time tracking: For a temporal graph, our algo- 
rithm precisely tracks both Kt and Tt with less storage. By 
storing 60K edges of the past, we can track this informa- 
tion for a patent citation network with 16 million edges [39]. 
Refer to Fig. 1. We maintain a real-time estimate of both 
the transitivity and number of triangles with a single pass, 
storing less than 1% of the graph. We see some fluctuations 
in the transitivity estimate due to the randomness of the 
algorithm, but the overall tracking is fairly accurate. 

2. PREVIOUS WORK 

Enumeration of all triangles is a well- studied problem [13, 
28, 22, 7, 14]. Recent work by Cohen [15], Suri and Vassil- 
vitskii [30], Arifuzzaman et al. [3] give massively parallel im- 
plementations of these algorithms. Eigenvalue/trace based 
methods have also been used [31, 4] to compute estimates 
of the total and per-degree number of triangles. 

Tsourakakis et al. [32] started the use of sparsification meth- 
ods, the most important of which is Doulion [34]. Various 
analyses of this algorithm (and its variants) have been pro- 
posed [21, 33, 38, 24]. Wedge-sampling based algorithms 
provide provable accurate estimations on various triadic mea- 
sures on graphs [27, 29]. 

Theoretical streaming algorithms for counting triangles were 
initiated by Bar-Yossef et al. [5] . Subsequent improvements 
were given in [19, 9, 1, 20]. The space bounds achieved 
are of the form mn/T. Note that m/VT < mn/T whenever 
T < (which is a reasonable assumption for sparse graphs) . 
These algorithm are rarely practical, since T is often much 
smaller than mn. (Some multi-pass streaming algorithms 
give stronger guarantees, but we will not discuss them here.) 

Buriol et al. [9] give an implementation of their algorithm. 
For almost all of their experiments on graphs, with stor- 
age of lOOK edges, they get fairly large errors (always more 
than 10%, and often more than 50%). Buriol et al provide 
an implementation in the incidence list setting, where all 
neighbors of a vertex arrive together. In this case, their 
algorithm is quite pratical since the errors are quite small. 
Our algorithm scales to sizes (100 million edges) larger than 
their experiments. We get better accuracy with far less stor- 
age, without any assumption on the ordering of the data 
stream. Furthermore, our algorithm performs accurate real- 
time tracking. 

Becchetti et al. [6] gave a semi- streaming algorithm for count- 
ing the triangles incident to every vertex. Their algorithm 



uses clever methods to approximate Jaccard similarities, and 
requires multiple passes over the data. 

3. OUTLINE 

We begin in §4 by providing an intuitive high-level explana- 
tion of the algorithm. §5 gives a formal description of our 
implemented algorithm (denoted Streaming-Triangles). 
The theoretical analysis is done in §6, for an idealized (and 
inefficient) variant called Single- Bit. We stress that Single- 
Bit is a thought experiment to highlight the theoretical as- 
pects of our result, and we do not actually implement it. 
In §6.1, we explain the heuristics used to get Streaming- 
Triangles. The remainder of §6 gives a mathematical anal- 
ysis of Single-Bit. 

Finally, in §7, we give results on our runs of Streaming- 
Triangles on real graphs. 

4. INTUITION FOR THE ALGORITHM 

The starting point is the idea of wedge sampling to estimate 
K [29]. A wedge is closed if it participates in a triangle 
and open otherwise. Note that k = 3T/W is exactly the 
probability that a uniform random wedge is closed. This 
gives a simple randomized algorithm for estimating k (and 
T), by generating a set of (independent) uniform random 
wedges and finding the fraction that are closed. But how do 
we sample wedges from a stream of edges? 

Suppose we just sampled a uniform random set of edges. 
How large does this set need to be to get a wedge? The birth- 
day paradox can be used to deduce that (as long as > m, 
pretty much always true for real networks) 0{^/n) edges 
suffice. A more sophisticated result, given in Lem. 1, pro- 
vides (weak) concentration bounds on the number of wedges 
generated by a random set of edges. A "small" number of 
uniform random edges can give enough wedges to perform 
wedge sampling (which in turn is used to estimate k). 

A set of uniform random edges can be maintained by the 
standard method of reservoir sampling [35]. From these 
edges, we generate a random wedge by doing a second level 
of reservoir sampling. This implicitly treats the wedges cre- 
ated in the edge reservoir as a stream, and performs reser- 
voir sampling on that. Overall, this approximates uniform 
random wedge sampling. 

As we maintain our reservoir wedges, we check for closure 
by the future edges in the stream. But there are closed 
wedges that cannot be verified, because the closing edge may 
have already appeared in the past. A simple observation 
used by past streaming algorithms saves the day [19, 9]. In 
each triangle, there is exactly one wedge whose closing edge 
appears in the future. So we try to approximate the fraction 
of these "future closed" wedges, which is exactly one-third 
of the fraction of closed wedges. (Hence, the factor 3 that 
pops up in Streaming-Triangles.) 

Finally, to estimate T from we need an estimate on the 
total number of wedges W. This can be obtained by re- 
verse engineering the birthday paradox: given the number 
of wedges in our reservoir sample of edges, we can estimate 
W (again, using the workhorse Lem. 1). 



5. THE PROCEDURE Streaming-Triangles 

The streaming algorithm maintains two primary sets of data: 
the edge reservoir and the wedge reservoir. These are sets 
of edges and wedges that the algorithm stores from the past. 
The parameters for the streaming algorithm are the respec- 
tive sizes of these sets, denoted by Se and Sw respectively. 
The main algorithm is described in Streaming-Triangles, 
although most of the technical computation is performed 
in Update (which is invoked every time a new edge ap- 
pears). After processing edge et, the algorithm computes 
running estimates for K,t and Tt. These values do not have 
to be stored, so they are immediately output and forgot- 
ten. We describe the main data structures of the algorithm 
Streaming-Triangles. 

• Array edge-res[l • • • Se]: This is the array of reservoir 
edges and is the primary subsample of the stream main- 
tained. 

• New wedges A/t: This is a list of all wedges involving 
et formed only by edges in edge_res. This may often be 
empty, if et is not in edge_res. We do not necessarily 
maintain this list explicitly, and we discuss implemen- 
tational details later. 

• Variable tot-wedges: This is the total number of wedges 
formed by edges in edge_res. 

• Array wedge_res[l • • • s^]'- This is an array of reservoir 
wedges of size Sw 

• Array isClosed[l • • • Sw]'- This is a boolean array. We 
set isClosed[i] to be true if wedge wedge-res[i] is found 
to be closed. 



Algorithm 1: STREAMING-TRIANGLES(Se, s^^;) 

Initialize edge^res of size Se and wedge^res of size Syj- For 
each edge et in stream, 
CaU Update (et). 

Let p be the fraction of entries in isClosed set to true. 
Set nt = 3p. 

Set Tt = [pt^ /se{se — 1)] X tot-wedges. 



Algorithm 2: UPDATE(et) 

For every reservoir wedge wedge-res[i] closed by et, set 
isClosed[i\ = true. 

Flip a coin with heads probability = 1 — (1 — 1/t)*^. If it 

flips to tails, stop and proceed to the next edge in stream. 

(So remaining code is processed only if coin comes heads.) 

Choose a uniform index i G [se]. Set edge-res[i] — et. 

Determine J\ft and let new-wedges = |A/t | . 

Update tot_wedges, the number of wedges formed by 

edge_res. 

Set q = new_wedges/ tot_wedges. 
For each index i G [sw]. 

Flip coin with heads probability q. 

If tails, continue to next index in loop. 

Pick uniform random w ^ J\ft that involves et . 

Replace wedge_res[i] = w. Reset isClosed[i] — false. 



5.1 Implementation details 

Computing nt and Tt are simple and require no overhead. 
We maintain edge^res as a time-variable subgraph. Each 
time edge_res is updated, the subgraph undergoes an edge 



insert and edge delete. Suppose et = {u^v). Wedges in 
Mt are given by the neighbors of u and v in this subgraph. 
From random access to the neighbor lists of u and we can 
generate a random wedge from J\ft efficiently. 

Updates to edge^res are very infrequent. At time t, the prob- 
ability of an update is 1 — (1 — l/tY'' . By linearity of ex- 
pectation, the total number of times that Mt is non-empty 
is 

^ 1 - (1 - 1/t)^" ^ Se/t ^ Selnm 

t<m t<m 

For a fixed Se, this increases very slowly with m. So for most 
steps, we neither update edge_res or sample a new wedge. 

The total number of edges that are stored from the past is 
Se -\- Sw The edge reservoir explicitly stores edges, and at 
most Sw edges are implicitly stored (for closure) . Regardless 
of the implementation, the extra data structures overhead is 
at most twice the storage parameters Se and Sw • Since these 
are at least 2 orders of magnitude smaller than the graph, 
this overhead is easy to pay for. 

6. THE IDEALIZED ALGORITHM Single- 
Bit 

The algorithm called Single- Bit is an idealized variant of 
Streaming- Triangles that we can formally prove theo- 
rems about. It requires more memory and expensive up- 
dates, but explains the basic principles behind our algo- 
rithm. We later give the memory reducing heuristics that 
take us from SiNGLE-BiT to Streaming-Triangles. 

The procedure Single- Bit has a single space parameter 
s and outputs a single (random) bit at each t. The ex- 
pectation of this bit is related to the transitivity Kt- We 
will describe Single-Bit in a more mathematical fashion. 
Single- Bit maintains a set of edges IZt of size s. The 
set of wedges constructed from IZt is Wf Formally, Wt = 
{wedge (e, e')|e, e' G For every et, Single-Bit records 
all wedges closed by this edge. Single-Bit maintains a set 
Ct , the set of wedges in Wt for which it has detected a closing 
edge. (Note that this is a subset of all closed wedges in Wt-) 
This set is easy to update as IZt changes. 



Algorithm 3: Single-Bit(s) 

1 Initialize IZq as a set of s dummy edges. 

2 For each et in stream, 

3 For each edge in 7^t-i, replace it (independently) by et 
with probability 1/t. This yields IZt. 

4 Construct the set of wedges Wt- 

5 Let Ct be the set of all wedges in Wt closed by et 

6 Add all wedges in (Ct-i n Wt) to Ct. If Wt is empty, 
output ht = and continue to next edge. 

7 Pick a uniform random wedge in Wt. 

8 Output 6t = 1 if this wedge is in Ct and 6t = otherwise. 



For convenience, we state this theorem for the final time 
step. However, it also holds (with an identical proof) for 
any large enough time t. 



Theorem 1. Assume W > m. Suppose s > cm/ (/S^VT), 
for some sufficiently large constant c. Set est — m^\Wm\/ {s{s— 
1)). Then — E[6rn]| < /3 and with probability > 1 — {3, 
\W -est\ < /3W. 

We now show that m/VT — 0{^n/ (usually much smaller 
for heavy tailed graphs) when W >m. Denote the degree of 
vertex vhy dv. In this case, we can bound 2W — dy (dy — 
1) = E„ rf' - 2m > dl - 2W, so > E„ dl/4. By 
2m — dv and the Cauchy- Schwartz inequality, 

Using the above bound, we get 

y^Sn/n. Hence, when W > m and At is a constant (both 
reasonable assumptions for social networks), we require only 
0{y/n) space. 

6.1 Circumventing problems with Single-Bit 

Single-Bit has two main problems, which are handled in 
Streaming-Triangles by heuristic arguments. 

Thm. 1 immediately gives a small sublinear space stream- 
ing algorithm for estimating The output of Single-Bit 
has the right expectation. We can run many independent 
invocations of Single-Bit and take the fraction of Is out- 
put to estimate E[6rn] (which in turn is close to k,/3). A 
Chernoff bound tells us that 0(l/e^) invocations suffice to 
estimate E[6m] within additive error e. The total space be- 
comes 0{m/{VTe^)), which can be very expensive in prac- 
tice. Even though m/VT is not large, for the reasonable 
value of e = 0.01, the storage cost blows up by a factor of 
10^. This is the standard method used in previous work for 
streaming triangle counts. 

This blowup is avoided in Streaming-Triangles by reusing 
the same reservoir of edges for sampling wedges. Note that 
Single-Bit is trying to generate a single uniform random 
wedge from C, and we use independent reservoirs of edges to 
generate multiple samples. Lem. 1 says that for a reservoir of 
km/VW edges, we expect wedges. So, if /c > 1/e and we 
get > wedges. Since the reservoir contains a large set 
of wedges, we could just use a subset of these for estimating 
E[6rn]- Unfortunately, these wedges are correlated with each 
other, and we cannot theoretically prove the desired concen- 
tration. In practice, it appears that these wedges are suffi- 
ciently uncorrelated that we get excellent results by reusing 
the reservoir. This is an important distinction from past 
streaming work [19, 9]. We can multiply our space by 1/e to 
(heuristically) get error e, but this is not possible through 
previous algorithms. Their space is multiplied by 1/e^. 

The second issue is that Single-Bit requires a fair bit of 
bookkeeping. We need to generate a random wedge from the 
large set Wt. While this is possible by storing edge_res as a 
subgraph, we have a nice (at least in the authors' opinion) 
heuristic fix that avoids these complications. 

Suppose we have a uniform random wedge w G Wt-i. We 
can convert it to an "almost" uniform random wedge in Wt. 
If Wt — Wt-i is true (so Aft — which is true most of 



of the time), then w is also uniform in Wt. Suppose not. 
Note that Wt is constructed by removing some wedges from 
Wt-i and inserting A/t- Since w is uniform random in Wt-i, 
if w is also present in Wt, then it is uniform random in 
Wt \ A/t . Replacing it; by a uniform random wedge in At 
with probability |A/t|/|Wt| yields a uniform random wedge 
in Wt. This is precisely what Streaming-Triangles does. 

When w ^ Wt, then the edge replaced by et must be in 
w. We approximte this as a low probability event and sim- 
ply ignore this case. Hence, in Streaming-Triangles, we 
simply assume that w is always in Wt. This is technically 
incorrect, it appears to have little effect on the accuracy in 
practice. And it leads to a cleaner, efficient implementation. 

6.2 Proving Thm. 1 

We begin with some preliminaries. First, the set TZt is a set 
of s uniform i.i.d. samples from {ei, 62, . . . , et}, a direct con- 
sequence of reservoir sampling. Next, we define future-closed 
wedges. Take the final graph G and label all edges with their 
timestamp. For each triangle, the wedge formed by the earli- 
est two timestamps is a future- closed wedge. In other words, 
if a triangle T has edges 6^,6^,6^, {i < j < k), then the 
wedge {e^, ej} is future- closed. The number of future-closed 
wedges is exactly T, since each triangle contains a single 
such wedge. We have a simple yet important claim about 
Single-Bit. 



Claim 1. The set Cm is exactly the set of future closed 
wedges in Wm . 

Proof. Consider some wedge {e^, ej},i < j in Wm- This 
wedge was formed at time j, and remains in all Wt for j < 
t < m. If this wedge is future closed (say by edge et', for 
t' > j), then at time t\ the wedge will be detected to be 
closed. Since this information is maintained by Single-Bit, 
the wedge will be in Cm- If the wedge is not future closed, 
then no closing edge will be found for it after time j. Hence, 
it will not be in Cm- □ 



The main technical effort goes into showing that the number 
of wedges formed by edges in IZm (this is precisely |Wm|) 
can be used to determine the actual number of wedges in 
Gm- Furthermore, the number of future closed wedges in 
IZm (precisely |Cm|, by Claim 1) can be used to estimate T. 

This is formally expressed in the next lemma. Roughly, if 
s — km/VW, then we expect k^ wedges formed by IZm- 
We also get weak concentration bounds for the quantity. A 
similar bound (with somewhat weaker concentration) holds 
even when we consider the set of future closed wedges. 

Lemma 1 (Birthday paradox for wedges). LetG be 
a graph with m edges and S be some fixed subset of wedges in 
G. Let IZ be a set of s uniformly and independently selected 
edges (with replacement) from G. Let X be the random vari- 
able denoting the number of wedges in S formed by edges in 

n. 

1. E[A] = Q)(2|<S|/m^). 



2. Let P > be a parameter and c be a sufficiently large 
constant. Assume W > m. If s > cm/{l3^VW), 
then with probability at least 1 — /3, \X — E[X]| < 

{pw/\smx]- 

Proof. The first part is an adaptation of tfie birtfiday 
paradox calculation. Let the set 71 = {ri,r2, . . . ,rs}. We 
define random variables Xij for each i,j G [s] with i < 
j. Let Xij = 1 if the wedge {ri,rj} belongs to S and 
otherwise. Then X = X^i<j 

Since 71 consists of uniform i.i.d. edges from C, the fol- 
lowing holds: for every i < j and every (unordered) pair 
of edges {ea,e/3} from Pr[{ri,rj} = {6^,6/3}] = 

2|<S|/m^. By linearity of ex- 
Q)E[Xi,2] = Q)Pr[Xi,2 = 1] 



This implies Fi[Xij 
pectation, we have E[X] = 
= (2) (2|<S|/m^), as required. 



The second part is obtained by applying the Chebyschev 
inequality. Let yar[X] denote the variance of X. For any 
/i > 0, 



Pr[|X - E[X]\ >h]< Var[X]/h'^ 



(1) 



We need an upper bound on the variance of X to apply 
(1). The proof of the following bound on yar[X] is fairly 
technical and deferred to ^6.3. 



set of future closed wedges. The number of future closed 
wedges is T, so Lem. 1 tells us that E[|Cm|] = {l){2T/m'^). 
Hence, E[|C^|]/E[|Wm|] = T/W = 

In general, the value of |Cm|/|Wm| might be quite differ- 
ent from E[|Cm|]/E[|Wm|]. Note that E[|Cm|] > c V/^^ by 
choice of s. Using the concentration bounds of Lem. 1, we 
can argue that the deviation of | Cm | / 1 Wm | from E [ | Cm | ] /E [ | Wm | ] 
is at most /3 with probability > 1 — /3. 

6.3 Bounding the variance: proof of Lem. 2 

Proof. We use the same notation as in the proof of 
Lem. 1. For convenience, we set /a = E[Xi,j], which is 
2|<S|/m^. By the definition of variance and linearity of ex- 
pectation, 

Var[X] = B[X^] - (E[X])^ 

i<j p<q i<j,p<q 

The summation is split as follows. 

i<j,p<q 

^'ElXlj]^ E '^[^i,J^P,q]+ E '^[^iJ^P,< 



i<j 



i<j,p<q 

\{i,j}n{p,q}\^^ 



^<3,p<q 
{*,j}n{p,g}= 



Lemma 2 (Variance bound). Assuming W >m and 
s > m/VW, Var[X] < cs^W^^'^/m^ (for sufficiently large 
constant d ). 



We deal with each of these terms separately. For conve- 
nience, we refer to the terms (in order) as ^1,^2, A3. We 
first list the upper bounds for each of these terms and derive 
the final bound on yar[X]. 



Weset/i= {^W l\S\)E\X\. Note that E[X] = s(s-l)|<S|/r 
By (1), Pr[|X - E[X]| > h\ is at most the following. 



h2 - /3^w^(E[X])y\S\^ 



/3^VWm^{s{s - l)|<S|/m2)2 

^ cVnA^W) <^ 
s 

where the final inequality holds by the bound on s. □ 



We sketch the proof of Thm. 1. (Details in §6.4.) At the 
end of the stream, the output bit bm is 1 if |Wm| > and a 
wedge from Cm is sampled. Note that both |Wm| and \Cm\ 
are random variables. 

To deal with the first event, we apply Lem. 1 with S being 
the set of all wedges. So, E[|Wm|] = {'^){2W/m^). Since 
s > cm/{l3^VT) > cm/{^^VW), E[|Wm|] > c V/^^ (a fairly 
large number). Intuitively, the probability that |Wm| = 
should be very small, and this can be bounded using the 
concentration bound of Lem. 1. 

For this informal discussion, let us just assume that |Wm| > 
happens. Now, the probability that bm — ^ (which is 
E[6m]) is exactly the fraction |Cm|/|Wm|. Suppose both 
these behave like their expectation. By Claim 1, Cm is the 
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We shall prove these shortly. From these, we directly bound 
Var[X]. 



Var[X]^Ai + A2 + As 



2 

s \ 2 



2 [ S\ 2 



Note that 6(4) = s(s-l)(s-2)(s-3)/4 < [s(s-l)/2]^ Since 
the ^3-norm is less than the ^2-norm, dl < (J^^ d^)^^^ . 

Since > m, we have J^v^l = ^J2v (t) + 2m < 4W. 
Plugging these bounds in (and using gross upper bounds to 
ignore constants). 



Var[X] < s^^i + 2& 



^ 2s^\S\ ^ 16g^V^^/^ 

^ 2s^W IQs^W^^^ ^ ISs^W^^^ 
^ 2- + 3 ^ 3 

The final step uses the fact that s > m/VW. □ 



We now bound the terms Ai, A2, and A3 in three separate 
claims. 

Claim 2. E.<,- = ©M- 

Proof. Since Xij only takes the values and 1, E[Xf = 



Claim 3. 



J2 nx^,JX,,,] < 12 



i<j,p<q 

\{i,j}n{p,q}\=^l 



/ vein] 



Proof. How many terms are in the summation? There 
are 3 distinct indices in {i,j,p,q}. For each distinct triple 
of indices, there are 6 possible way of choosing i < j,p < q 
among these indices such that H {p, ^}| = 1. This 

gives 6(3) terms. By symmetry each term in the summation 
is equal to E[Xi, 2-^1,3]- This is exactly the probability that 
{ri,r2} G S and {ri^rs} G S. We bound this probability 
above by 2^^^^^^ dl/m^, completing the proof. 

Let 8 be the event {ri,r2} G S and {ri^rs} G S. Let 
be the event that edge ri intersects edges r2 and rs. Ob- 
serve that 8 implies J^. Therefore, it suffices to bound the 
probability of the latter event. We also use the inequality 
Va,6, (a + 5)2 < 2{a^^b^). 

Pr[{ri,r2}G5, {n.rs} e S] 
< Pr[ri nr2 / 0, n n rs / 0] 

= XI v}nr2y^ 0, {u, v}nr3y^ 0|ri = {u, v)] • — 



(u,v)eE 



< 



2 ^ [du + dv) 

2 ^v^Wi] 



{u,v)eE 

J3 



For the final equality, consider the number of terms in the 
summation where d^ appears. This is the number of edges 
{u,v) (over all v), which is exactly du- □ 



6.4 Proof of Thm.l 

Proof. As mentioned in the proof sketch, the output bit 
bm is 1 if |Wm| > and a wedge from Cm is sampled. For 
convenience, we will set Y = |Wm| and Z = \Cm\- Note that 
Y is the total number of wedges formed by edges in 1Zm- 
And Z is the the number of future closed wedges formed 
by edges in IZm- Both Y and Z are random variables that 
depend on Tim- Condition on some Tim generated by the 
Single-Bit. If \Wm\ > 0, then the expectation of bm is 
exactly Z/Y. The next claim extends this. 



Claim 5. Let 8 be an event such that 8 implies y > 
and Fr[8] is at least 1-/3'. Then \E[bm] - E[Z/Y\8]\ < I3\ 

Proof. Let T denote the event y > 0. Since Single- 
Bit outputs when T does not hold, we get E[6rn|-^] = 0. 
Since E[bm] = E[6^|J^] Pr[J^] + E[6^| J] Pr[J], E[bm] = 
E [6rn I Pr[J^]. Further observe that E[6m|-^] is exactly 
equaltoE[Z/y|J^]. Therefore, we get E [6^] = E[Z/y | J"] Pr[J^]. 
Further conditioning the expectation on the event 8, we get 

E[6^] = E[Z/y| J-] Pr[J-] 

= {E[Z/Y\J^n8] PT[8\J^]+E[Z/Y\J^n8] Pr[F| J^]) • Pr[J^] 

= E[z/y I J- n 8] Pt[8 n j^] + e[z/y\j' n 8] Pi[8 n J*] 

= E[Z/Y\8] Pt[8] + E[Z/Y\J' n 8] Pt[8 n J"] (2) 

The second last equality uses the fact that Pt[A n B] = 
Pi[A\B] • Pr[B], while the last equality uses the fact that 
T n 8 = 8 (since 8 implies T). Note that Z/Y < 1. 
Thus, (2) is at least E[Z/Y\8] Pt[8] > E[Z/y|^](l - /3')_> 
E[Z/Y\8]-/3\ Moreover, (2) is at most E[Z/y |^] + Pr[^ n 
J=-] < E[Z/Y\8] + Pr[^] < E[Z/Y\8] + /3^ □ 



Henceforth, we focus on the quantity 'E[Z/Y\8] (for an ap- 
propriately chosen 8 defined shortly). The overall strategy 
is to show that this is close to E[Z]/E[y]. This requires 
bounds for deviations of Y, Z from their corresponding ex- 
pectations. 

First, we determine E[y] and E[Z] from Lem. 1. Applying 
the lemma with S as the set of all wedges in G to obtain 
E[y] = {l)2W/m^ = s{s-l)W/m^. Analogously, with S as 
the set of all future closed wedges, we get E[Z] = (1^2T / . 
(Recall that the number of future closed wedges in G is T.) 



Claim 4. 

E[X.,,X^,,]=6^L^ 

'^<j,P<Q \ / 

{^,j}n{p,g} = 

Proof. There are 6(^) terms in the summation ((^) ways 
of choose {i,j,p,q} and 6 different orderings). Note that 
Xij and Xp^q are independent, regardless of the structure 
of G or the set of wedges S. This is because Pi[Xij = 
l\Xp,q = 0] = Pr[X,,, = l\Xp,q = 1] = Pr[X,,, = 1]'. In 
other words, the outcomes of the random edges rp and rq do 
not affect the edges ri , rj (by independence of these draws) 
and hence cannot affect the random variable Xij. Thus, 
'E[XijXp^q] = E[Xi,j]E[Xp^g] = /i^. □ 



Next we apply the second part of Lem. 1 to random vari- 
ables Y and Z with the parameter /3 in the statement of 
the lemma set to (the smaller quantity) 13' = 13/5. We 
can do this because by the premise of the theorem, s > 
cm/{l3^VT) > cm/{l3^VW). Moreover, for large enougn 
constant c, the latter is at least c'm/ {/3'^\/W), as required 
in order to apply the second part of Lem. 1. Therefore, with 
probability > 1 - /3' , \Y - E[Y] \ < /3'E[Y]. (This suffices to 
prove the second part of Thm. 1, since est = rn^Y/ s{s — 1) 
and m^E[y]/s(s - 1) = 1^.) With probability > 1 - 
|Z-E[Z]| < {P'W/T)E[Z]. Note that {P'W/T)E[Z] = 

For convenience, let 8 denote the event 

max(|y - E[y]|, \Z - E[Z]|) < 0^[Y\. 



Table 1: Properties of the graphs used in the experiments 



Graph 


n 


m 


W 


T 




cmiazon0312 


401K 


2350K 


69M 


3686K 


0.160 


amazon0505 


410K 


2439K 


73M 


3951K 


0.162 


cimazon0601 


403K 


2443K 


72M 


3987K 


0.166 


as-skitter 


1696K 


11095K 


16022M 


28770K 


0.005 


cit-Patents 


3775K 


16519K 


336M 


7515K 


0.067 


DBLP 


317K 


1049K 


21M 


224K 


0.3064 


roadNet-CA 


1965K 


2767K 


6M 


121K 


0.060 


web-BerkStan 


685K 


6649K 


27983M 


64691K 


0.007 


web-Google 


876K 


4322K 


727M 


13392K 


0.055 


web-NotreDame 


326K 


1090K 


305M 


8910K 


0.088 


web-Stanford 


282K 


1993K 


3944M 


11329K 


0.009 


wiki-Talk 


2394K 


4660K 


12594M 


9204K 


0.002 


youtube 


1158K 


2990K 


1474M 


3057K 


0.006 


flickr 


1861K 


15555K 


14670M 


548659K 


0.112 


live journal 


5284K 


48710K 


7519M 


310877K 


0.124 


orkut 


3073K 


223534K 


45625M 


627584K 


0.041 




Random 1 Random2 Bfs 



Random 1 Random2 Bfs 



Dfs Degree sorted 



(a) Transitivity 



(b) Number of triangles 



Figure 2: Output of a single run of Streaming-Triangles 
on various orderings of the data stream for web-NotreDame. 
The true value is given by the thick black line. 



By the union bound on probabilities, Pr[^] > 1 — 2/3' . By 
choice of s, E[y] > c ^/^'^. Hence, when S happens, Y > 0. 
(In other words, with probability at least 1 — 2/3' , the edges 
in IZrn will form a wedge.) 

Now look at E[Z/y|^]. From the expectation bound, we 
know that E[Z]/E[y] = T/W = k/3. When 8 occurs, Z/Y 
is not far from this value. Formally, when 8 occurs, 

E[Z] - /3'E[Y] ^ Z ^ E[Z]+/3'E[Y] 
(l + /30E[y] - Y - (1 -/30E[y] 

Manipulating the upper bound, (for small enough /3') 
nZ]+mY] < ^ 2/?')M + 2B' - K/3 + 4B' 



Using a similar calculation for the lower bound, when 8 
occurs, \ZIY-kI?>\ < 4/3\ This implies \B[Z/Y\8]- k/3\ < 
4/3\ Using Claim 5, we get, \E[bm] - < 5/3' = /3. 

7. EXPERIMENTAL RESULTS 

We implemented our algorithm in C++ and ran our exper- 
iments on a MacBook Pro laptop equipped with a 2. 8GHz 
Intel core 17 processor and 8GB memory. The vital statistics 
of all the graphs that we experiment upon are provided in 
Tab. 1. For our case studies, we focus on the web-NotreDame 
and amazon0505 graphs, having IM and 3M edges respec- 
tively. 

Effects of storage on estimates: We explore the effect 
that the parameters Se, Sw (referred to as the edge reservoir 
and wedge reservoir sizes) have on the quality of the esti- 
mates for K and T. Both of these graphs have around IM 
edges. Each of these is converted into a random stream of 
edges. 

We set Se to be lOK, 20K, and 40K, and vary Sw from lOK 
to 70K in increments of lOK. For each setting of the param- 
eters, we perform a single run of Streaming-Triangles 
on these streamed graphs. This is to give a true indication 
of the behavior of Streaming-Triangles. The results are 
shown in Fig. 3 and Fig. 4, both for the transitivity and tri- 
angle counts for web-NotreDame and amazon0505. We fix a 
setting of Se and increase Sw - In terms of the theory, a large 
Se helps generate wedges that are closer to uniform. A larger 



Sw then helps to get a sharper estimate for the fraction of 
future closed wedges. 

Observe that the algorithm is almost always within 10% of 
the true answer. The statistical deviation is larger from edge 
reservoirs of lOK and 20K, but it quite small for 40K. Indeed 
for Se =40K, the output is well within 5%. In general, these 
figures show that the algorithm does not have wild statisti- 
cal fluctations and is fairly well behaved as we increase the 
space. Increasing the space gives a predictable improvement 
in the estimate. 

Comparison with previous work: The streaming algo- 
rithm of Buriol et al [9] was implemented and run on real 
graphs. In general, they get fairly large error even with 
storage of lOOK edges. We provide comparisons with an im- 
plementation of their algorithm for arbitrary streams (they 
also have a version only for incidence streams, but we do not 
compare with that). The basic sampling procedure involves 
sampling a random edge and a random vertex and trying the 
complete a triangle. This is repeatedly independently in par- 
allel to get an estimate for the number of triangles. Buriol et 
al provide various heuristics to speed up their algorithm, but 
the core sampling procedure is what was described above. 

In Fig. 5, we compare runs of Streaming-Triangles with 
their algorithm for our test graphs. For convenience, we just 
refer to their algorithm as "Buriol et al". The x-axis is the 
space used, and the y-axis gives the triangle estimate. For 
simplicity, we fix Se — 20K and increase Sw in Streaming- 
Triangles. For Buriol et al, we count the storage of a 
single edge and vertex a single unit of space. (We ignore the 
extra two edges needed to complete the triangle, and sim- 
ply count the number of samples used.) While the previous 
algorithm has estimates off by 50% or more, Streaming- 
Triangles is much more accurate over all of its runs. (In- 
deed, the figure is zoomed out so much that the statistical 
flucations of Streaming-Triangles are barely visible.) For 
amazon0505, the estimate give by Buriol et al is zero till 80K 
samples. We note that these results are consistent with those 
given in [9]. 

Runs on various graphs: We run Streaming-Triangles 
on a variety of graphs obtained from the SNAP database 




(a) web-NotreDame: Transitivity (b) web-NotreDame: Triangles 

Figure 3: web-NotreDame detailed runs: We fix the edge reservoir, Se, to lOK, 20K, and 40K. Then we increase the wedge 
reservoir size, Sw, from lOK to 70K and track the output of Streaming-Triangles. We give plots for transitivity and triangle 
counts. 



[39]. We simply set Se as 20K and Sw as 20K for all our 
runs. Each graph is converted into a stream by taking a 
random ordering on the edges. In Fig. 6, we show our re- 
sults for estimating both k and T. The absolute values are 
plotted for k together with the true values. For triangles, 
we plot the relative error (so \est — T\/T, where est is the 
algorithm output) for each graph, since the true values can 
vary over orders of magnitude. Observe that the transitiv- 
ity estimates are very accurate. The relative error for T is 
mostly below 8%, and often below 4%. 

All the graphs listed have millions of edges, so our storage is 
always 2 orders of magnitude smaller than the graph. Most 
dramatically, we get accurate results on the Orkut social 
network, which has 220M edges. The algorithm stores only 
30K edges, a 0.0001 -fraction of the graph. Also observe the 
results on the Flickr and Live journal graphs, which also 
run into tens of millions of edges. 

Effect of stream ordering: It is not possible to check 
that our algorithm works for all orderings of the stream. 
So we generate a different orderings of the edges in web- 
NotreDame and run Streaming-Triangles on ah the or- 
derings. The results are given in Fig. 2. As before, we fix the 
edge and wedge reservoir to 20K. We discuss the different 
orderings below. 

We first have two random orderings. Next, we generate a 
stream through a bfs as follows. We take a bfs tree from a 
random vertex and list out all edges in the tree. Then, we 
list the remaining edges in random order. Since the average 
degree of web-NotreDame is around 3, the tree has about 
one-third of the total edges. So this ordering is fairly differ- 
ent from a random ordering. Our fourth ordering involves 
taking a dfs from a random vertex and list out edges in order 
as seen by the dfs. Finally, we sort the vertices by degree 
and list all edges incident to a vertex (this is an incidence 



stream) . 

Streaming-Triangles performs reasonably on all these 
different orderings. There is little deviation in the transitiv- 
ity values. There is somewhat more difference in the triangle 
numbers, but it never exceeds 10% relative error. This seems 
to be simply deviations in repeated runs of Streaming- 
Triangles. 

Real-time tracking: A major benefit of Streaming-Triangles 
is that it can maintain a real-time estimate of Kt and Tt . We 
take a real-world temporal graph, cit-Patents, which con- 
tains patent citation data over decades. The edges are time 
stamped with the year of citation, and hence give a stream 
of edges. Using more storage of an edge reservoir of 40K 
and wedge reservoir of 15K, we accurately track these val- 
ues over time (refer to Fig. 1). Note that this is still orders 
of magnitude smaller than the full size of the graph, which 
is 16M edges. The errors of our estimates are overall quite 
small. (The true values are only given for the year ends.) 
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Figure 5: Comparison of our algorithm with Buriol et al [9] for amazon0505 and web-NotreDame. We use a 20K edge reservoir 
and variable wedge reservoir for our storage. 
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Figure 6: Output of a single run of Streaming-Triangles on a variety of real datasets with 20K edge reservoir and 20K 
wedge reservoir. The plot on the left gives the estimated transitivity values (labelled streaming) alongside their exact values. 
The plot on the right gives the relative error of Streaming-Triangles 's estimate on triangles T. Observe that the relative 
error for T is mostly below 8%, and often below 4%. 
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